Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2

The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17–23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54–58 h, using considerably more memory. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02743-6).


Choice of frequency thresholds
The frequency threshold f 0 of k-mers ((k + 1)-mers in case of Cuttlefish 2) for the algorithms when working with sequencing data was approximated so as to roughly minimize the misclassification rates of weak and solid k-mers in these experiments. This was performed based on approximate frequency distributions of the k-mer frequencies themselves, computed using the ntCard tool [84]. The heuristic setting-policy of f 0 is inspired from observations by Zhao et al [65]: the frequency distribution of erroneous k-mers tend to diminish exponentially, whereas that of error-free kmers typically follow a normal distribution; and the intersecting point of these density functions can be a reasonable choice for f 0 , which we approximated with ntCard [84]. Suppl. Fig. S1 shows some of these approximate distributions.
1.2 Compacted graph construction for sequencing data Suppl. Table S1 contains the performance results of the evaluated tools for compacted de Bruijn graph construction from sequencing data.
Suppl. Table S2 shows the performance results on the human read set with a frequency cutoff of f 0 = 2.

Compacted graph construction for reference
collections Suppl. Table S3 shows the performance results of the evaluated tools for compacted de Bruijn graph construction from reference sequence collections.
1.4 Timing-profile without (k + 1)-mer (or k-mer) enumeration We tested the hypothesis of whether having a uniform k-mer enumerator for Cuttlefish 2 and BCALM 2 might significantly impact their performance difference. Suppl. Table S4 demonstrates the timing-profile 1.5 Validation of the compacted de Bruijn graphs The validation of a compacted de Bruijn graph consists of checking three aspects of the graph: (1) completeness: whether the set of maximal unitigs contain all the k-mers from the original de Bruijn graph; (2) maximality: whether the output unitigs are actually maximal. and (3) branch-freeness: that the complete, maximal cover of the de Bruijn graph contains no paths having internal vertices than branch in the underlying de Bruijn graph.
Theoretically, the Cuttlefish 2 algorithm obtains all three of these criteria. Completeness is obtained trivially, and maximality and branch-freeness are obtained as per Theorem 1. We cross-checked the correctness of the actual implementation by validating the output graphs of Cuttlefish 2 against those of BCALM 2 and Bifrost. In doing so, we observed some small-scale differences in both the k-mer-content (completeness) and unitig-content (maximality). We provide some informal reasoning for these differences here:  Figure S1: Frequency distribution of k-mer abundances: (a) for the human read set NIST HG004 with k = 27, and (b) for the white spruce read set NCBI PRJNA83435 with k = 55. Densities for the frequencies 1 and 2 have been omitted from the plots, as those dwarf the other frequency densities and skew the plots drastically.

Vertex-centric versus Edge-centric de Bruijn graphs
For a given dataset, let its k-mer set be V and (k + 1)-mer set be E. Consider its vertex-centric de Bruijn graph G V and its edge-centric de Bruijn graph G E . For ease of exposition, assume that there is no k-mer (or (k + 1)-mer) filtering performed, and that the graphs do not have tips 2 , which consist of a negligible number of vertices for real datasets. Both the graphs have the same vertex set V.
By definition, any edge e 1 in G E must also exist in G V . But this does not necessarily hold true in the opposite direction: some edge e = {u, v} in G V could be Each cell contains the running time in wall clock format, and the maximum memory usage in gigabytes, in parentheses. Details on executing the different tool implementations can be found in Table 1 (See main text). The best performance with respect to each metric in each row is highlighted, where only the default-memory mode is considered for Cuttlefish 2. The ∆'s in the ABySS-Bloom-dBG results denote that the corresponding executions were allowed to run for at least 2 days, before being explicitly terminated. 22h 44m (59.9 | 2047) 22h 20m (129.5 | 1974) Each cell contains the running time in wall clock format, and in parentheses: the maximum memory usage and the maximum intermediate disk-usage separated by |, in gigabytes. All the execution details and other relevant information can be found in Table 2 (see main text). absent in G E -although there exists a (k − 1)-length overlap between the k-mers u and v, the (k + 1)-mer u k−1 v could be absent in E. 3 Thus G V must always have an equal or greater number of edges than G E . As a result, G V contains an equal or larger number of branching, i.e. unitig-flanking vertices than G E . 4 This reduces the number of maximal unitigs reported 3 For clarity, we are not considering the k-mer orientations. 4 It could be possible that some of these additional edges in G V connect two separate maximal unitigs into one, thus actually reducing branching vertices. The assumption that G E contains no tips prevents this-there can not exist an edge {x, y} in G V such that, x and y are in in edge-centric de Bruijn graphs compared to vertexcentric ones.
Vertex-filtering versus Edge-filtering Related to the above point, since the fundamental units of de Bruijn graph construction in Cuttlefish 2 are the edges, this is where error-filtering is performed prior to construction. Conversely, BCALM 2 and Bifrost take a vertex-centric approach to construction and hence filtering is performed on the vertex set. In some corner cases, where unit-abundances are very close to the selected threshold, this can lead tip-ends in G E , and connect the two tips they belong to into one single maximal unitig in G V . Each cell contains the running time in wall clock format, excluding the times incurred by the initial: (a) k-mer enumeration step of BCALM 2, and (b) (k + 1)-mer enumeration step of Cuttlefish 2. All the execution details and other relevant information can be found in the Tables 1 and 2 (see main text).
to small-scale differences in which k-mers are filtered out prior to construction. Consider a given threshold f 0 . Any k-mer x present in the input at least f 0 times is a vertex in the vertexcentric graph. But there may not exist any (k + 1)-mer z in the input that occurs at least f 0 times and contains x, thus x is absent as a vertex in the corresponding edge-centric graph. On the opposite direction, any kmer y, substring of a (k + 1)-mer z that is present in the input at least f 0 times, is a vertex in the edgecentric graph. It also implies that y has an abundance of at least f 0 in the input, and thus is also a vertex in the vertex-centric graph.
Therefore, when using a same frequency threshold f 0 for k-mers and (k + 1)-mers, vertex-centric de Bruijn graphs must always have an equal or greater number of vertices than edge-centric ones.
1.6 Compacted de Bruijn graph properties Suppl. Table S5 contains some notable characteristics of the original de Bruijn graphs and their compacted forms.

Maximal path cover construction
Suppl. Table S6 provides a comparison of the maximal unitig based and the maximal path cover based representations of the de Bruijn graphs.

Parallel scaling
Suppl. Fig. S2 demonstrates the timing-profile and speedup for each step of Cuttlefish 2, on the same setting as described in Sec. 2.7 (see main text), but with k = 55.

Application in associative k-mer index construction
The utility of Cuttlefish 2 and any compacted de Bruijn graph constructor depends upon the downstream applications for which it is used. In this proof of-concept section, we demonstrate the improvement provided by Cuttlefish 2 over alternative methods in a pipeline that constructs an associative k-mer index over a collection of reads or references. These indices, sometimes implicitly, form a fundamental component in various computational genomics tasks, such as in tools for variant detection and genotyping [13], RNA isoform quantification [86], large-scale sequence search [87], and k-mer abundance indexing [88].
Given a set V of k-mers, an associative k-mer index of V consists of a bijective mapping f : V → [0, |V|). It is different from a minimal perfect hash h : V → [0, |V|) in that, any alien k-mer v / ∈ V can be detected by f as absent in V, i.e. ∀v / ∈ V f(v) = −1; but not necessarily by h.
We investigated the overall performance difference for a pipeline that uses SSHash [54] to index the k-mer set-represented with the de Bruijn graph-of several  Given a de Bruijn graph G(R, k) = (V, E) and a representation of it P, the base/k-mer metric is computed as p∈P |p| / |V|, i.e. the number of nucleobase characters required in average per k-mer for the literal representation of the paths in P (maximal unitigs decomposition is also a path-cover). If the 2-bit/nucleobase encoding is used instead of the literal representations, then then the bits/k-mer requirement would be 1/4'th of the base/k-mer requirement.
datasets of various sizes. In the pipeline, we first extract the maximal path (or unitig) sequences from the graph, and then use SSHash to construct a k-mer index from these sequences. We compare the performance of this pipeline when using Cuttlefish 2 to extract the maximal path cover (or unitigs) versus using UST (or BCALM 2). Suppl. Table S7 provides a comparison of the performances. We observe that, for intermediate or large datasets, when using UST (or BCALM 2) to extract the maximal path cover (or unitigs), the extraction of the sequences itself is both the time and the memory bottleneck. That is, indexing the extracted sequences with SSHash is both faster and more memory frugal than extracting the sequences in the first place, often by a considerable factor. On the other hand, when we use Cuttlefish 2 to extract the sequences, the time taken to extract the sequences, and therefore the time taken to construct the entire index, is reduced dramatically. Additionally, the memory usage of Cuttlefish 2 is often comparable (or in the case of the Gut microbiome reads less than) that of SSHash. Thus, replacing UST (or BCALM 2) with Cuttlefish 2 in this index construction greatly reduces the bottleneck step of associative index construction, and sometimes even shifts the bottleneck itself from the task of extracting a maximal path cover (or unitigs) of the de Bruijn graph to the task of constructing the index from these sequences.   Each cell contains the running time in wall clock format, and the maximum memory usage in gigabytes, in parentheses. In all the executions, k is 27, and the maximal path cover and the maximal unitigs are the ones obtained from the experiments performed in Secs. 2.3, 2.4, and 2.5 (see main text). The dataset descriptions are also present in these sections. All the execution details and other relevant information can be found in Tables 1, 2, and 3 (see main text). The thread count used for UST and Cuttlefish 2 in the maximal path cover based construction is 8; for BCALM 2 and Cuttlefish 2 in the maximal unitigs based construction, it is 16. The SSHash implementation is single-threaded.
The following commands have been used in executing the tools.
• ABySS-Bloom-dBG: where ip_type_arg is either r or s, based on whether reference-sequences or short-reads are provided as input, respectively.
where the -t ${threads} argument has been added by us to control the number of processorcores for it to use-its default setting uses up-to all the available cores. • SSHash: cgexec -g memory: where the cgroup_name is a Linux control group, set with an appropriate memory limit to restrict SSHash's reported memory usage. This is necessary due to SSHash's use of the mmap system call, which results in the counting of unused shared pages toward the program's memory usage, particularly in a memory-rich execution environment where the program is not under memory pressure. As such, the time command will report a much higher memory usage than is actually required by SSHash to run. The cgroup execution places a hard limit on the memory the program can use, and applies the requisite memory pressure to ensure that the reported memory usage is much closer to what is actually required for successful execution. • Cuttlefish 2: In the following, the ${read_or_ref_arg} is either read or ref, based on whether referencesequences or short-reads are provided as input, respectively. -

Upgrades in the KMC 3 algorithm
We implemented several upgrades in the KMC 3 algorithm to tune it to the efficiency needs for Cuttlefish 2. Here we discuss those briefly. Although the upgrades were designed specifically for usage in Cuttlefish 2, those may also be suitable in other bioinformatics pipelines, and are publicly available in the KMC 3 GitHub repository (https://github.com/r efresh-bio/kmc).
2.1.1 Counting k-mers from existing KMC 3 database KMC 3 is updated so as to be able to count k -mers from a k-mer database produced by another KMC 3 execution, for some k < k. This allows reducing computational resources needed to determine the set of vertices, V, as it may be directly computed from the set of edges, E, without the need of an entire pass over all the input sequences. This is especially relevant in the case of sequencing reads. Technically, the KMC 3 API is used in the listing mode to enumerate all k-mers that are further processed as if they were reads.
2.1.2 Estimate k-mer abundance histogram during the first stage in KMC 3 This upgrade allows efficient estimation of the total number of unique k-mers present in the input during the first stage of KMC 3. The estimation is performed by our optimized implementation of the ntCard algorithm [84].

Using KMC 3 directly from C++ code with API
A new API to use KMC 3 directly from inside some C++ code is designed for Cuttlefish 2, and it is usable in general. Furthermore, it is possible to set parameters for the second stage of KMC 3 based on the results of the first stage. The detailed documentation of API is available in the KMC 3 GitHub repository: https:// github.com/refresh-bio/kmc/wiki/Use-the-KMCdirectly-from-code-through-the-API. Combining this with the capability to estimate k-mer abundance histograms, it is possible to bound the memory-usage of the second stage of KMC 3 such that it uses at most the peak amount of memory required in the next steps of Cuttlefish 2.

Storing k-mers without counts in KMC 3
databases This upgrade affects disk usage. To date, KMC 3 output required at least one byte per k-mer to store a counter. In some applications, e.g. to build the compacted de Bruijn graph without abundance estimates for the vertices, the counters are not required and can be skipped. In practice, this leads to the reduction of disk usage and, as a consequence, reduction in the total I/O costs, which in turn affects the running time.

Proofs
Lemma 1 The (k + 1)-mers z and z induce the same bidirected edge in a de Bruijn graph G(S, k).
Proof Consider a (k + 1)-mer z from some input string s ∈ S. Let x = pre k (z) and y = suf k (z). Then z can be expressed as z = x k−1 y.
z induces an edge between the vertices x and y. It is incident to the back of x when x = x holds, and is incident to the front when x = x (see Sec. 3.2, main text).
z's reverse complement is z = y k−1 x, and it induces an edge between y and x. It is incident to the front of x if x = x holds, and is incident to the back if x = x-the same side as that of z's edge.
It can be proven likewise that the edges are incident to the same side of y. Therefore, z and z induce edges between the same vertex-pair { x, y}, incident to the same sides-inducing the same bidirected edge.
Lemma 2 A side of a vertex can have at most |Σ| distinct edges in a de Bruijn graph G(S, k).
Proof Consider a vertex v in G(S, k). WLOG, we prove the claim for the back of v.
As per Lemma 1, the (k+1)-mers c· v and v·c induce the same bidirected edge, where c ∈ Σ. Thus E 1 and E 2 induce the same set of edges. Therefore, the back of v can have at most |E 1 | = |E 2 | = |Σ| distinct edges.
Lemma 3 A vertex v is noted to be a flanking vertex in a de Bruijn graph G(S, k) iff it is an endpoint of a maximal unitig.
Proof Let C v be the state-class of v's automaton and p be the maximal unitig containing v. The term branching in the proof means connecting to multiple distinct edges.
First, assume that v is marked as a flanking vertex. We prove that v is an endpoint of p. As per the definition of flanking vertices (see Sec. 3.3.8, main text), either of the following holds: 1. C v is not unique-front unique-back. Then from Corollary 1, v has at least one side s v with either 0 or > 1 distinct edges. It is not possible to extend p through s v -either there is no edge, or the addition introduces an internal branching vertex v in p. 2. C v is unique-front unique-back, and a side of it, s v , is connected to a branching side s u of a vertex u.
Then p can not be extended through s v , because the extension includes s u as an internal side to p, which is branching. In either case, v is an endpoint of p. Now assume that v is an endpoint of p. We prove that v is marked as a flanking vertex. Based on the adjacencies of v, either of the following holds: 1. v has at least one side s v that is either empty or branching. From Corollary 1, C v is not unique-front unique-back. 2. v has one unique edge at each side. Say that its side s v restricts p from extending farther, and s v connects to the side s u of a vertex u. The definition of unitigs implies that s u must be branching. This in turn implies from Corollary 1 that u's automaton's state is from the state-class: (i) either fuzzy-front fuzzy-back ; or (ii) fuzzy-front unique-back, in which case s u is front; or (iii) unique-front fuzzy-back, in which case s u is back. In either case, v fulfills the conditions for being a flanking vertex.